INTRODUCTION

This report will investigate and analyse the data set of policing from Dallas for the year of 2016. The analysis will have but not limited to exploration of data set, numerical representation of figures as well as the graphical presentation of the same. Exploration of data set will also focus on identification and dealing with the missing values and to identify which of the features are important in the data set.

This report is focused on finding answers to the questions such as

These research objectives just lay out the foundation but this report contains much more than these

Preliminary Analysis

Firstly, we will read the data set and will examine the number of features and observations we have. Also, we will look at the summary statistics of the data set and this will enable us in identification of missing values for each feature. Some missing values can be discarded using the complete case method whereas the important features cannot be discarded and can therefore be imputed with mean/median or through multiple imputations. The code below will read the data set from csv file and show the top 20 observation of the data set for each feature.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("C:/Users/Admin/Downloads")
policing <- read.csv('Prepped data.csv', header = TRUE)

Looking at the data set, we can see that the first row is useless as it contains the names of the features and can be removed. Therefore, we will remove the first row of each feature that contains the names of the features which we already have as variable names.

policing <- policing[-1,]

Further, we will look at the characteristics of the data set to check for data structure, dimensions and names. By running the summary, and Dim functions, we get to know that there we have 2383 observations in the data set and 47 different features. And, most of the feature’s data type is ‘Character’. However, we still have some features containing numerical values and we can differentiate between numerical and non-numerical features using R built in function.

dim(policing)
## [1] 2383   47

Before we proceed further to the analysis, we should verify that we have the necessary libraries installed and called for this specific project to avoid the errors that we may face due to unavailability of those required libraries.

library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(finalfit)
## Warning: package 'finalfit' was built under R version 4.1.3
library(scales)
library(ggpol)
## Warning: package 'ggpol' was built under R version 4.1.3

As we have looked through the macros of our data which includes the number of features, observations and data types. The data type “Character” will not allow us to proceed with the analysis because data contains some numeric values which are mistakenly classed as character but indeed they are not. Therefore, we will first assign the features proper classes that they actually belong to.

policing$OFFICER_ID <- as.integer(as.character(policing$OFFICER_ID))
policing$OFFICER_YEARS_ON_FORCE<- as.integer(as.character(policing$OFFICER_YEARS_ON_FORCE))
policing[, 12]     <- as.numeric(policing[, 12])
policing[, 20]     <- as.numeric(policing[, 20])
policing[, 21]     <- as.numeric(policing[, 21])
policing[, 22]     <- as.numeric(policing[, 22])
policing[, 32]     <- as.numeric(policing[, 32])
policing[, 33]     <- as.numeric(policing[, 33])
policing[, 5]     <- as.factor(policing[, 5])
policing[, 6]     <- as.factor(policing[, 6])
policing[, 9]     <- as.factor(policing[, 9])
policing[, 11]     <- as.factor(policing[, 11])
policing[, 13]     <- as.factor(policing[, 13])
policing[, 14]     <- as.factor(policing[, 14])
policing[, 15]     <- as.factor(policing[, 15])
policing[, 17]     <- as.factor(policing[, 17])
policing[, 18]     <- as.factor(policing[, 18])
policing[, 23]     <- as.factor(policing[, 23])
policing[, 24]     <- as.factor(policing[, 24])
policing[, 34]     <- as.factor(policing[, 34])
policing[, 35]     <- as.factor(policing[, 35])

we have the date feature in the data set which needs proper class categorisation. Therefore, we will give date feature the proper class and will add this column as an additional feature in the data set to keep the original data as it is. After adding the extra feature, now we have two features that are basically same making the count of total features to 48 but keeping both does not effect the consumption of time as well as computationally cost effective.

new_date <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")

policing['new_date'] <- new_date

hiringbyy <- as.Date(policing$OFFICER_HIRE_DATE, "%m/%d/%Y")

policing['yearwisehiring'] <- year(hiringbyy)

dim(policing)
## [1] 2383   49

A new subset having numerical class is created which will ease the process of utilizing the features containing numerical values and for further processing to visualize the results.

Also, we will look at the distribution of the numeric data set using the histograms.

policing.numeric <- policing[,sapply(policing, is.numeric)]
dim(policing.numeric)
## [1] 2383    9
par(mfrow=c(3, 3))
colnames <- dimnames(policing.numeric)[[2]]
for (i in 1:8) {
  hist(policing.numeric[,i], main=colnames[i], probability=TRUE, col="gray", border="white")
}

Now, we will look at the percentage of the missing values in each of the feature which will enable us to make a decision on which of the features are redundant and can be removed to avoid the complexity in analysis as well as to decrease the computational cost.

sum(is.na.data.frame(policing))
## [1] 110
missing_glimpse(policing, dependent = NULL, explanatory = NULL, digits = 1)
##                                                                                     label
## INCIDENT_DATE                                                               INCIDENT_DATE
## INCIDENT_TIME                                                               INCIDENT_TIME
## UOF_NUMBER                                                                     UOF_NUMBER
## OFFICER_ID                                                                     OFFICER_ID
## OFFICER_GENDER                                                             OFFICER_GENDER
## OFFICER_RACE                                                                 OFFICER_RACE
## OFFICER_HIRE_DATE                                                       OFFICER_HIRE_DATE
## OFFICER_YEARS_ON_FORCE                                             OFFICER_YEARS_ON_FORCE
## OFFICER_INJURY                                                             OFFICER_INJURY
## OFFICER_INJURY_TYPE                                                   OFFICER_INJURY_TYPE
## OFFICER_HOSPITALIZATION                                           OFFICER_HOSPITALIZATION
## SUBJECT_ID                                                                     SUBJECT_ID
## SUBJECT_RACE                                                                 SUBJECT_RACE
## SUBJECT_GENDER                                                             SUBJECT_GENDER
## SUBJECT_INJURY                                                             SUBJECT_INJURY
## SUBJECT_INJURY_TYPE                                                   SUBJECT_INJURY_TYPE
## SUBJECT_WAS_ARRESTED                                                 SUBJECT_WAS_ARRESTED
## SUBJECT_DESCRIPTION                                                   SUBJECT_DESCRIPTION
## SUBJECT_OFFENSE                                                           SUBJECT_OFFENSE
## REPORTING_AREA                                                             REPORTING_AREA
## BEAT                                                                                 BEAT
## SECTOR                                                                             SECTOR
## DIVISION                                                                         DIVISION
## LOCATION_DISTRICT                                                       LOCATION_DISTRICT
## STREET_NUMBER                                                               STREET_NUMBER
## STREET_NAME                                                                   STREET_NAME
## STREET_DIRECTION                                                         STREET_DIRECTION
## STREET_TYPE                                                                   STREET_TYPE
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION
## LOCATION_CITY                                                               LOCATION_CITY
## LOCATION_STATE                                                             LOCATION_STATE
## LOCATION_LATITUDE                                                       LOCATION_LATITUDE
## LOCATION_LONGITUDE                                                     LOCATION_LONGITUDE
## INCIDENT_REASON                                                           INCIDENT_REASON
## REASON_FOR_FORCE                                                         REASON_FOR_FORCE
## TYPE_OF_FORCE_USED1                                                   TYPE_OF_FORCE_USED1
## TYPE_OF_FORCE_USED2                                                   TYPE_OF_FORCE_USED2
## TYPE_OF_FORCE_USED3                                                   TYPE_OF_FORCE_USED3
## TYPE_OF_FORCE_USED4                                                   TYPE_OF_FORCE_USED4
## TYPE_OF_FORCE_USED5                                                   TYPE_OF_FORCE_USED5
## TYPE_OF_FORCE_USED6                                                   TYPE_OF_FORCE_USED6
## TYPE_OF_FORCE_USED7                                                   TYPE_OF_FORCE_USED7
## TYPE_OF_FORCE_USED8                                                   TYPE_OF_FORCE_USED8
## TYPE_OF_FORCE_USED9                                                   TYPE_OF_FORCE_USED9
## TYPE_OF_FORCE_USED10                                                 TYPE_OF_FORCE_USED10
## NUMBER_EC_CYCLES                                                         NUMBER_EC_CYCLES
## FORCE_EFFECTIVE                                                           FORCE_EFFECTIVE
## new_date                                                                         new_date
## yearwisehiring                                                             yearwisehiring
##                                              var_type    n missing_n
## INCIDENT_DATE                                   <chr> 2383         0
## INCIDENT_TIME                                   <chr> 2383         0
## UOF_NUMBER                                      <chr> 2383         0
## OFFICER_ID                                      <int> 2383         0
## OFFICER_GENDER                                  <fct> 2383         0
## OFFICER_RACE                                    <fct> 2383         0
## OFFICER_HIRE_DATE                               <chr> 2383         0
## OFFICER_YEARS_ON_FORCE                          <int> 2383         0
## OFFICER_INJURY                                  <fct> 2383         0
## OFFICER_INJURY_TYPE                             <chr> 2383         0
## OFFICER_HOSPITALIZATION                         <fct> 2383         0
## SUBJECT_ID                                      <dbl> 2383         0
## SUBJECT_RACE                                    <fct> 2383         0
## SUBJECT_GENDER                                  <fct> 2383         0
## SUBJECT_INJURY                                  <fct> 2383         0
## SUBJECT_INJURY_TYPE                             <chr> 2383         0
## SUBJECT_WAS_ARRESTED                            <fct> 2383         0
## SUBJECT_DESCRIPTION                             <fct> 2383         0
## SUBJECT_OFFENSE                                 <chr> 2383         0
## REPORTING_AREA                                  <dbl> 2383         0
## BEAT                                            <dbl> 2383         0
## SECTOR                                          <dbl> 2383         0
## DIVISION                                        <fct> 2383         0
## LOCATION_DISTRICT                               <fct> 2383         0
## STREET_NUMBER                                   <chr> 2383         0
## STREET_NAME                                     <chr> 2383         0
## STREET_DIRECTION                                <chr> 2383         0
## STREET_TYPE                                     <chr> 2383         0
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION    <chr> 2383         0
## LOCATION_CITY                                   <chr> 2383         0
## LOCATION_STATE                                  <chr> 2383         0
## LOCATION_LATITUDE                               <dbl> 2328        55
## LOCATION_LONGITUDE                              <dbl> 2328        55
## INCIDENT_REASON                                 <fct> 2383         0
## REASON_FOR_FORCE                                <fct> 2383         0
## TYPE_OF_FORCE_USED1                             <chr> 2383         0
## TYPE_OF_FORCE_USED2                             <chr> 2383         0
## TYPE_OF_FORCE_USED3                             <chr> 2383         0
## TYPE_OF_FORCE_USED4                             <chr> 2383         0
## TYPE_OF_FORCE_USED5                             <chr> 2383         0
## TYPE_OF_FORCE_USED6                             <chr> 2383         0
## TYPE_OF_FORCE_USED7                             <chr> 2383         0
## TYPE_OF_FORCE_USED8                             <chr> 2383         0
## TYPE_OF_FORCE_USED9                             <chr> 2383         0
## TYPE_OF_FORCE_USED10                            <chr> 2383         0
## NUMBER_EC_CYCLES                                <chr> 2383         0
## FORCE_EFFECTIVE                                 <chr> 2383         0
## new_date                                       <date> 2383         0
## yearwisehiring                                  <dbl> 2383         0
##                                              missing_percent
## INCIDENT_DATE                                            0.0
## INCIDENT_TIME                                            0.0
## UOF_NUMBER                                               0.0
## OFFICER_ID                                               0.0
## OFFICER_GENDER                                           0.0
## OFFICER_RACE                                             0.0
## OFFICER_HIRE_DATE                                        0.0
## OFFICER_YEARS_ON_FORCE                                   0.0
## OFFICER_INJURY                                           0.0
## OFFICER_INJURY_TYPE                                      0.0
## OFFICER_HOSPITALIZATION                                  0.0
## SUBJECT_ID                                               0.0
## SUBJECT_RACE                                             0.0
## SUBJECT_GENDER                                           0.0
## SUBJECT_INJURY                                           0.0
## SUBJECT_INJURY_TYPE                                      0.0
## SUBJECT_WAS_ARRESTED                                     0.0
## SUBJECT_DESCRIPTION                                      0.0
## SUBJECT_OFFENSE                                          0.0
## REPORTING_AREA                                           0.0
## BEAT                                                     0.0
## SECTOR                                                   0.0
## DIVISION                                                 0.0
## LOCATION_DISTRICT                                        0.0
## STREET_NUMBER                                            0.0
## STREET_NAME                                              0.0
## STREET_DIRECTION                                         0.0
## STREET_TYPE                                              0.0
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION             0.0
## LOCATION_CITY                                            0.0
## LOCATION_STATE                                           0.0
## LOCATION_LATITUDE                                        2.3
## LOCATION_LONGITUDE                                       2.3
## INCIDENT_REASON                                          0.0
## REASON_FOR_FORCE                                         0.0
## TYPE_OF_FORCE_USED1                                      0.0
## TYPE_OF_FORCE_USED2                                      0.0
## TYPE_OF_FORCE_USED3                                      0.0
## TYPE_OF_FORCE_USED4                                      0.0
## TYPE_OF_FORCE_USED5                                      0.0
## TYPE_OF_FORCE_USED6                                      0.0
## TYPE_OF_FORCE_USED7                                      0.0
## TYPE_OF_FORCE_USED8                                      0.0
## TYPE_OF_FORCE_USED9                                      0.0
## TYPE_OF_FORCE_USED10                                     0.0
## NUMBER_EC_CYCLES                                         0.0
## FORCE_EFFECTIVE                                          0.0
## new_date                                                 0.0
## yearwisehiring                                           0.0

we have 110 total null values in the data set for the features that are important for the analysis where 55 are missing in the latitute and 55 are missing in longitude and these values make upto 4.6% total missing values for both the columns and such missing values are not significant and can be ignored. However, features with the names of type of force used 4 till 10 including the number_EC_cycles have majority of null values and they are redundant for the analysis. Therefore, deleting or ignoring such features is an appropriate way to deal with it.
In the last we will remove the duplicate rows from the data, if there is any. And the conditions will be based on multiple features containing date, time, UOF number and officer ID. This means that the few entries were duplicate and should be removed before further processing. Now, we have 2328 observations and 49 features because of 2 extra feature added for analysis.

policing <- policing[!duplicated(policing[c(1,2,3,4)]),]
dim(policing)
## [1] 2328   49

Numerical & Graphical Analysis

There are different type of visualizations which are but not limited to bar plots, histograms, scatter plots and the data set is about the officers, their ethnicity, sex group, number of years served, arrests made and about subjects, their crimes, sex group and ethnicity etc.
Before, we proceed for visualization, lets have the subset of the main data set that has unique values for the number of officers and subjects.

new_date <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")

policing['new_date'] <- new_date

hiringbyy <- as.Date(policing$OFFICER_HIRE_DATE, "%m/%d/%Y")

policing['yearwisehiring'] <- year(hiringbyy)

xgen <- distinct(policing, OFFICER_ID, OFFICER_RACE, .keep_all= TRUE)

Finding the Correlation between the data set

require(ggcorrplot)
## Loading required package: ggcorrplot
## Warning: package 'ggcorrplot' was built under R version 4.1.3
corr.dataset = policing.numeric %>% select(SECTOR, BEAT, REPORTING_AREA, SUBJECT_ID, OFFICER_ID, OFFICER_YEARS_ON_FORCE)
corr = cor(corr.dataset)
corr.p = cor_pmat(corr)

ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3,
                outline.col = "black", ggtheme = ggplot2::theme_bw(),
                colors = c("yellow", "green", "blue"))

We can see that the correlation plot to identify that “officer_id” and “officer_years_on_force” has a negative correlation with the coefficient value of -0.92 whereas “beat” & “sector” feature has a positive correlation with the value of 1.

Numerical representation of the Gender Distribution of officers

table(policing$OFFICER_GENDER)
## 
## Female   Male 
##    236   2092

Numerical representation of the Unique Gender Distribution of officers

distinctvals <- policing %>%
  group_by(OFFICER_GENDER) %>%
  summarise(n=n_distinct(OFFICER_ID))
distinctvals
## # A tibble: 2 x 2
##   OFFICER_GENDER     n
##   <fct>          <int>
## 1 Female           126
## 2 Male             915

Graphical representation of the Gender Distribution of officers

gender = ggplot(data = policing, aes(x=OFFICER_GENDER))
gender1 = gender + geom_bar(fill="#fb8072", alpha=0.4) +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "Blue") +
   xlab("GENDER") + ylab("Number of Officers")
gender2 = gender1 + theme_bw() + theme(axis.title = element_text(size = 12, color = "blue"), axis.text = element_text(size = 10, face = "bold"))
gender2

we have 2092 male officers and 236 female officers. But this count is without removing the duplicates. Indeed, the real figures of the officers are plotted below:

Numerical representation of the Unique Gender Distribution of officers

distgenderplot <- policing %>%
  group_by(OFFICER_GENDER) %>%
  summarise(n=n_distinct(OFFICER_ID)) %>% 
  ggplot(.,aes(OFFICER_GENDER, n)) + geom_bar(stat = 'identity', fill="#fb8072", alpha=0.5)  + ylim(0,1000) +
  xlab("Gender Segregation") + ylab("Number of Officers")
distgenderplot1 = distgenderplot + theme_bw() + theme(axis.title = element_text(size = 12, color = "blue"), axis.text = element_text(size = 10, face = "bold"))
distgenderplot1

Actually, we have 126 female officers and 917 male officers.

Plotting the numbers of years of service

Lets have a look at the number of years officers served.

(hist(x=xgen$OFFICER_YEARS_ON_FORCE, breaks = 4, ylim = c(0,1000), col = 2:6, main = "Yearly Grouped Segregation of number of years served", ylab = "Number of officers", xlab = "No of Years", border = "White", labels = TRUE))
## $breaks
## [1]  0 10 20 30 40
## 
## $counts
## [1] 729 167 128  19
## 
## $density
## [1] 0.069894535 0.016011505 0.012272291 0.001821668
## 
## $mids
## [1]  5 15 25 35
## 
## $xname
## [1] "xgen$OFFICER_YEARS_ON_FORCE"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
text(5, 325, "Majority", srt = 90)

We can see that 729 officers have 0 to 10 years of experience which makes them the majority in the total group of officers whereas on the otherside, police department has very less officers that are experienced and have over 20 years of experience and their count is only 147. If we talk about the officers that have 30 to 40 years of service, it can be predicted that they are at the end of their tenure and will soon be retiring. Also, such officers are well placed in office and may not take part in ground activities.
Lets have another look at the same thing with the interactive plot, where we can see the specific number of officers with respect to their years of service.

yearsserved = ggplot(data=xgen, aes(x=OFFICER_YEARS_ON_FORCE))
noofyears = yearsserved + geom_histogram(binwidth = 1, fill="#2b8cbe", alpha=0.6) +
  xlab("Years Served") + ylab("Number of officers") + xlim(0,40) + ylim(0,120) 
ggplotly(noofyears)

Plotting the ethnicity with respect to genders

Lets have a look at how many different ethnicity officers we have and the gender of those officers with in those ethnical groups

racegender <- table(xgen$OFFICER_GENDER, xgen$OFFICER_RACE)
racegender
##         
##          American Ind Asian Black Hispanic Other White
##   Female            1     5    29       26     3    62
##   Male              6    22   149      204     6   530
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.3
xgen <- distinct(policing, OFFICER_ID, OFFICER_RACE, .keep_all= TRUE)
racegender1 <- table(xgen$OFFICER_GENDER, xgen$OFFICER_RACE)
xgender <- barplot(as.matrix(t(racegender1)), beside = TRUE, main = "Ethnic segregation of genders", ylab = "Number of Officers", xlab = "Gender",
                   legend.text = TRUE,
                   args.legend=list(bty="7", x = "topleft" , ncol = 3), ylim = c(0,800),
                   col=brewer.pal(7,"Set1"), border="Black")
ygender<-as.matrix(t(racegender1))
text(xgender, ygender+50, labels = as.character(ygender))

Plotting the division & district wise Distribution of Dallas Police

Now we will have a look at the number of distribution that the dallas police have.They have actually distributed the dallas in two ways that are division and location. Lets have a look at number of locations and divisions for the same.

distgen <- xgen %>%
  group_by(DIVISION) %>% summarise(count=n_distinct(OFFICER_ID))
distgen
## # A tibble: 7 x 2
##   DIVISION      count
##   <fct>         <int>
## 1 CENTRAL         197
## 2 NORTH CENTRAL   132
## 3 NORTHEAST       146
## 4 NORTHWEST       114
## 5 SOUTH CENTRAL   126
## 6 SOUTHEAST       162
## 7 SOUTHWEST       165
distgen <- xgen %>%
  group_by(DIVISION) %>% summarise(count=n_distinct(OFFICER_ID))

distgenplot <- plot(distgen$DIVISION, distgen$count, main = "Division wise count of officers", col = 2, xlab = "DIVISONS", ylab = "Number of officers")

distgenplot
## $stats
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  197  132  146  114  126  162  165
## [2,]  197  132  146  114  126  162  165
## [3,]  197  132  146  114  126  162  165
## [4,]  197  132  146  114  126  162  165
## [5,]  197  132  146  114  126  162  165
## 
## $n
## [1] 1 1 1 1 1 1 1
## 
## $conf
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  197  132  146  114  126  162  165
## [2,]  197  132  146  114  126  162  165
## 
## $out
## numeric(0)
## 
## $group
## numeric(0)
## 
## $names
## [1] "CENTRAL"       "NORTH CENTRAL" "NORTHEAST"     "NORTHWEST"    
## [5] "SOUTH CENTRAL" "SOUTHEAST"     "SOUTHWEST"

Analysis of the divisions and districts

Lets have a look at number of locations in a single division.

length(table(policing$LOCATION_DISTRICT))
## [1] 14
locindiv <- policing %>% group_by(DIVISION) %>% summarise(noofdistricts = n_distinct(LOCATION_DISTRICT)) %>% arrange(desc(noofdistricts))
locindiv
## # A tibble: 7 x 2
##   DIVISION      noofdistricts
##   <fct>                 <int>
## 1 NORTHEAST                 6
## 2 CENTRAL                   5
## 3 SOUTHWEST                 5
## 4 NORTHWEST                 4
## 5 SOUTH CENTRAL             4
## 6 SOUTHEAST                 4
## 7 NORTH CENTRAL             3

As the tables above shows that dallas has 7 divisions and 14 districts. And, northeast division is dealing with large number of districts for the policing in dallas city as well as handling with criminal offences.

DIVISIONS <- locindiv$DIVISION
clrgenre <- c('#8dd3c7','#ffffb3','#bebada','#fb8072','#80b1d3','#fdb462','#b3de69', '#000000')
ggplot(locindiv, aes(x=noofdistricts, fill=DIVISIONS)) + geom_histogram(binwidth=1) + scale_fill_manual(values = clrgenre) +
  labs(x = "Number of Districts",
       y = "Number of Divisions Covering Districts",
       title = "Graph Visualizing number of districts covered by divisions")

The graph above shows that north central is dealing with 3 districts whereas southeast, south central and northwest are dealing with 4 districts. There is one division that is dealing with 6 districts and it is “North East”.

Officers Years of Experience and their ##Hospitalization ###and ##Injury ###analysis

Lets have a look at the officers getting injured and hospitalized in accordance with their years of service.

ggplot(policing, 
       aes(x = OFFICER_YEARS_ON_FORCE, 
           fill = OFFICER_INJURY)) +
  geom_density(alpha = 0.4) +
  labs(title = "Years of experience distribution by officer injury")

The graph above shows that the experience with less experience are injured the most and the officers with more experience gets less injuries.

library(ggplot2)
ggplot(policing, 
       aes(x = factor(OFFICER_HOSPITALIZATION,
                      labels = c("Not Hospitalized",
                                 "Hospitalized")), 
           y = OFFICER_YEARS_ON_FORCE, 
           fill=OFFICER_HOSPITALIZATION)) +
  geom_boxjitter(color="black",
                 jitter.color = "darkgrey",
                 errorbar.draw = TRUE) +
  labs(title = "Officer Hospitalization by Years of Experience", 
       subtitle = "9-month salary for 2008-2009",
       x = "Hospitalization Status",
       y = "Years of Experience") +
  theme_minimal() +
  theme(legend.position = "none")

Same is the case with hospitalization as it can be seen from the graph above that the officers with less than 10 years of experience are hospitalized more often as compared to the officers with experience of more than 10 years.We can have another look on officer getting injured and getting hospitalized or maybe it was just an injury but they were not hospitalized.

datenew <- policing
datenew1 <- datenew %>% mutate(hireyear = mdy(OFFICER_HIRE_DATE)) %>%
  mutate(hyear = year(hireyear))
library(forcats)
xgennew <- datenew1
levels(datenew1$OFFICER_HOSPITALIZATION) <- list(NotHospitalized = "No", Hospitalized = "Yes")
levels(datenew1$OFFICER_INJURY) <- list(NotInjured = "No", Injured = "Yes")

G = ggplot(xgennew, aes(x=hyear, y =OFFICER_YEARS_ON_FORCE)) + geom_boxplot(aes(group = hyear))                    
(G = G + facet_grid(OFFICER_HOSPITALIZATION ~ OFFICER_INJURY) + labs(title = "Officer Injury & Hospitalization", 
       x = "Hiring Year",
       y = "Years of Experience"))

The graph above shows the officers population who is hospitalized when injured or when not injured.

Now lets have a look at the officers who made arrests and who did not and this is with respect to their years of experience and the year hired.

Officers arrest status with respect to their years of experience

datenew1 <- datenew1 %>% mutate(hiringbyyear= case_when(
  hyear >= 2000 & hyear <= 2010  ~ "2001 to 2010",
  hyear > 1978 & hyear < 2000 ~ "till 1990",
  TRUE ~ "after 2010")
)

G = ggplot(datenew1, aes(hyear, OFFICER_YEARS_ON_FORCE, color = hiringbyyear)) + geom_point()
G = G + scale_color_manual(values = c("blue", "yellow", "red"))

New_names <- c(`Yes` = "Made Arrest", `No` = "Did Not Make Arrest")
arrestsmade <- G + facet_wrap(~ SUBJECT_WAS_ARRESTED, labeller = as_labeller(New_names)) + labs(title = "Officer Arrest Status", 
       subtitle = "Categorized Hiring Date & their arrest record",
       x = "Hiring Year",
       y = "Years of Experience")
arrestsmade

Numerical representation of distinct number of crimes recorded

distcrimes <- policing %>%
  summarise(n=n_distinct(SUBJECT_ID))
distcrimes
##      n
## 1 1433

So there are 1433 crimes that are reported in the year 2016.

Month-wise distribution of crime records

Lets segregate the crimes reported in each month for 2016.

new_date_crime <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")
policing['crime_date'] <- new_date_crime
policing['crime_month'] <- month(new_date_crime)



(hist(x=policing$crime_month, breaks = 8, ylim = c(0,1000), col = 2:6, main = "Monthly count of crimes reported", ylab = "Number of crimes", xlab = "Months", border = "White", labels = TRUE))
## $breaks
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## 
## $counts
##  [1] 474 259 223 215 184 179 182 201 164 147 100
## 
## $density
##  [1] 0.20360825 0.11125430 0.09579038 0.09235395 0.07903780 0.07689003
##  [7] 0.07817869 0.08634021 0.07044674 0.06314433 0.04295533
## 
## $mids
##  [1]  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $xname
## [1] "policing$crime_month"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
text(1.5, 200, "High Count", srt = 90)

The graph above clearly shows that the crimes reported in the month of January are very high. There is another graph which visualizes the same thing in an appropriate way.

datenew <- policing

library(lubridate)
library(DT)
## Warning: package 'DT' was built under R version 4.1.3
datenew <- policing %>%
  mutate(new_inc_date = mdy(INCIDENT_DATE)) %>%
  mutate(CRIMEMONTH = month(new_inc_date, label =TRUE)) %>% 
  mutate(CRIMEDAY = day(new_inc_date)) %>% 
  mutate(CRIMEWEEK = wday(new_inc_date, label = TRUE))

datenew %>%
 count(CRIMEMONTH) %>%
  mutate(`Number of Crime` = n) %>%
  select(CRIMEMONTH, `Number of Crime`)%>% 
  DT::datatable(colnames = c("Month", "Number of Crimes"), 
                options = list(pageLength = 12))
library(plotly)
monthly <-datenew %>%
 count(CRIMEMONTH) %>%
  mutate(`Number of Crime` = n) %>% 
  ggplot() +
  aes(x = CRIMEMONTH, weight = `Number of Crime`) +
  geom_bar(fill = "green", width = 0.3) +
  labs(
    x = "Month\n",
    y = "\nNumber of Crime\n",
    title = "Number of crime by Month",
    subtitle = "Year:2016"
  ) +
  theme_light() ->m

  ggplotly(
    m,
    tooltip = c("CRIMEMONTH", "Number of Crime")
    )

The graph above is the interactive plot showing the number of crimes that are reported in each month.

Gender distribution involved in committing crimes

subgen <- policing %>% 
  group_by(SUBJECT_GENDER) %>%
  summarise(n=n_distinct(SUBJECT_ID))


subgen$fraction <- subgen$n / sum(subgen$n)


subgen$ymax <- cumsum(subgen$fraction)


subgen$ymin <- c(0, head(subgen$ymax, n=-1))


subgen$labelPosition <- (subgen$ymax + subgen$ymin) / 2


subgen$label <- paste0(subgen$SUBJECT_GENDER, "\n value: ", subgen$n)

newsubgen <- subgen[-c(3, 4), ]


(ggplot(newsubgen, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=SUBJECT_GENDER)) +
  geom_rect() +
  geom_text( x=5.2, aes(y=labelPosition, label=label, color=SUBJECT_GENDER), size=6) + # x here controls label position (inner / outer)
  scale_fill_brewer(palette=5) +
  scale_color_brewer(palette=5) +
  coord_polar(theta="y") +
  xlim(c(-1, 4)) +
  theme_void() +
  theme(legend.position = "none") +
  ggtitle("No of Female and Male reported for crime"))

So, we have 1170 males and 260 females who are reported for crimes.

Ethnicity distribution of individuals involved in crime

Lets have a look at crimes segregated by ethnicity.

require(rAmCharts)
## Loading required package: rAmCharts
## Warning: package 'rAmCharts' was built under R version 4.1.3
## Full amcharts.js API available using amChartsAPI()
## Look at rAmCharts::runExamples() & http://datastorm-open.github.io/introduction_ramcharts/
## Bug report or feed back on https://github.com/datastorm-open/rAmCharts
## 
## Attaching package: 'rAmCharts'
## The following object is masked from 'package:plotly':
## 
##     api
library(dplyr)
piedata <- policing %>% 
  group_by(SUBJECT_RACE) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(percentage = `n` / sum(`n`)) %>% 
  arrange(percentage) %>%
  mutate(labels = scales::percent(percentage))

piedata %>% dplyr::select(label=SUBJECT_RACE, value = n) %>%
     amPie(., inner_radius = 25, 
         depth = 6, 
         show_values = TRUE,
         legend = TRUE,
         export = TRUE,
         main = "Percent of criminal by Race")

Numerical analysis of criminals injured, arrested and injured when arrested

Lets have a look at the arrests of subjects. First, we will look at the total subjects that were arrested. Then we will look at the total subjects that were injured. And finally, we will have a look at subjects that were arrested and also injured. And, a numerical analysis is used here.

table(policing$SUBJECT_WAS_ARRESTED)
## 
##   No  Yes 
##  317 2011
table(policing$SUBJECT_INJURY)
## 
##   No  Yes 
## 1704  624

The tables above show that out of the total policing records of 2016, 2011 were arrested and 317 were not arrested. And according to the complete record of 2016, 624 were injured and 1704 were not injured. Now lets have a look at the number of injured criminals who were arrested.

arrtoinj <- policing %>% filter(SUBJECT_WAS_ARRESTED == 'Yes') %>% group_by(SUBJECT_INJURY) %>% summarise(length(SUBJECT_INJURY))
arrtoinj
## # A tibble: 2 x 2
##   SUBJECT_INJURY `length(SUBJECT_INJURY)`
##   <fct>                             <int>
## 1 No                                 1435
## 2 Yes                                 576

So, out of 2011 arrested criminals, 576 were injured and remaining 1435 remained uninjured.

Description of Criminal at time of arrest

library(dplyr)
library(ggplot2)

datenew1 %>% count(SUBJECT_DESCRIPTION) %>%
  ggplot(aes(x= reorder(SUBJECT_DESCRIPTION, n), y = n)) +
  geom_col(fill = "#756bb1") +
  labs(x = "Offense Type",
       y = "Number of Crimes",
       title = paste0("Offense types reported in 2016")) +
  coord_flip() +
  theme_minimal()

The graph above displays the subjects description with respect to the number of crimes committed. We can see that most of the crimes are committed by mentally unstable people and alcoholic individuals.

Reason of force used by police officer

library(dplyr)
library(ggplot2)

datenew1 %>% count(REASON_FOR_FORCE) %>%
  ggplot(aes(x= reorder(REASON_FOR_FORCE, n), y = n)) +
  geom_col(fill = "#f46d43") +
  labs(x = "Reason for force",
       y = "Number of times force used",
       title = paste0("Reason of force used")) + coord_flip() +
  theme_minimal()

And the graph above shows the number of times force is used and the reason for which force is used.

Geo-spatial data of crime locations

Lets plot the map to look the crime map of dallas.

library(leaflet)
## Warning: package 'leaflet' was built under R version 4.1.3
policing %>% 
  leaflet(.) %>%
  addTiles() %>%
  addCircleMarkers(lng = ~LOCATION_LONGITUDE,
                   lat = ~LOCATION_LATITUDE, 
                   radius=0.01,
                   fillOpacity = 0.001)
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with either
## missing or invalid lat/lon values and will be ignored

Lets have a look at the locations, where most of the crimes are committed.

library(ggmap)
## Warning: package 'ggmap' was built under R version 4.1.3
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## 
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
## 
##     wind
sbbox <- make_bbox(lon = c(policing$LOCATION_LONGITUDE), lat = c(policing$LOCATION_LATITUDE), f = .1)
dallas = get_map(location=sbbox, zoom=10, maptype="terrain")
## Source : http://tile.stamen.com/terrain/10/236/412.png
## Source : http://tile.stamen.com/terrain/10/237/412.png
## Source : http://tile.stamen.com/terrain/10/236/413.png
## Source : http://tile.stamen.com/terrain/10/237/413.png
dallas1 = ggmap(dallas)

dallas1 +
  geom_point(data = policing, mapping = aes(x =LOCATION_LONGITUDE, y =LOCATION_LATITUDE), 
             color = "red") +
  geom_text(data = policing, 
            mapping = aes(x = LOCATION_LONGITUDE+0.1,
                          y = LOCATION_LATITUDE,
                          label = "Crime Spots"),
            size = 2, color = "gray20", 
            fontface = "bold", 
            check_overlap = T) 
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 71 rows containing missing values (geom_text).

datenew1[, 12]     <- as.numeric(datenew1[, 12])
datenew1[, 20]     <- as.numeric(datenew1[, 20])
datenew1[, 21]     <- as.numeric(datenew1[, 21])
datenew1[, 22]     <- as.numeric(datenew1[, 22])
datenew1[, 32]     <- as.numeric(datenew1[, 32])
datenew1[, 33]     <- as.numeric(datenew1[, 33])
names(datenew1)[names(datenew1) == "LOCATION_LONGITUDE"] <- "long"
names(datenew1)[names(datenew1) == "LOCATION_LATITUDE"] <- "lat"
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 3 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 3 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 4 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 11 rows with either
## missing or invalid lat/lon values and will be ignored